Automated web scraping using Obsidian Web Clipper and Puppeteer.
Obsidian is a powerful note-taking system. Using Puppeteer and the Obsidian Web Clipper, you can automatically download web content into your vault!
This is a slightly unorthodox way of sharing my process for scraping websites using obsidian, but this project was rather disjointed. I figured it would be easier for me to explain in narrative form what each item is intended to do.
Obsidian web scraper
The official obsidian web scraper is amazing. The scraper that I built uses this chrome plugin to power the parsing and sending data to obsidian. You can use this for lots of cool and interesting things.
Once you have the extension installed, you can go here to access a bunch of templates. This can give you a feel for what the extension can do and how it works. For my recipe scraper, I customized a few pieces of the recipe template included in the link above.
Problem: Automation
Once I had set up the template the way that I wanted to for the web scraper, I was ready to start scraping! Unfortunately the scraper is intentionally designed to activate only manually. There was no built in way to trigger the function automatically. Because I wanted to scrape 1000's of recipes, manually navigating to each page and then triggering the extension would be way too much work.
I will spare you the long and boring details of how I figured this out. The short version is that the only way I could figure out how to trigger the extension automatically was to build my own chrome extension, and then connect it to a local version of the Obsidian extension that I had customized to allow for external connections.
Chrome extension for auto trigger
The code for the custom chrome extension is found in this repo.
Steps to install it on chrome:
- Clone the project
- Open chrome extension manager and enable dev mode
- Once enabled press load unpacked and select the parent folder (should have manifest.json in it)
The extension basically fires a message immediately when you open a new url that matches one of the strings in the 'matches' section of the manifest.json file.
"content_scripts": [
{
"js": ["scripts/content.js"],
// Make sure to update this array to include any URLs you
// want to scrape
"matches": ["https://developer.chrome.com/docs/extensions/*",
"https://developer.chrome.com/docs/webstore/*"]
}
]
You will also want to make note of the id at the top of the background.js file. You will need to set this to the id of the local obsidian web scraper that you 'customize.'
// This needs to be set to the id of the extension you want to trigger
const const id: "lemfefnbebfkbajcafkjoklibjadafasg"
id = 'lemfefnbebfkbajcafkjoklibjadafasg';
Obsidian customizations
Obsidian uses the browser command and messages API to trigger the plugin whenever you hit the correct keyboard shortcut. You can hook into this pretty easily without changing a lot of code, which enables the web clipper to receive the trigger command from our custom extension above.
All you need to change is to add the following code snippet to line 183 of the src/background.ts file.
browser.runtime.onMessageExternal.addListener(
(
request: unknown
request: unknown,
sender: browser.Runtime.MessageSender
sender: browser.Runtime.type browser.Runtime.MessageSender = /*unresolved*/ any
MessageSender,
sendResponse: (response?: any) => void
sendResponse: (response: any
response?: any) => void
): true | undefined => {
if (typeof request: unknown
request === "object" && request: object | null
request !== null) {
const const typedRequest: {
action: string;
isActive?: boolean | undefined;
hasHighlights?: boolean | undefined;
tabId?: number | undefined;
}
typedRequest = request: object
request as {
action: string
action: string;
isActive?: boolean | undefined
isActive?: boolean;
hasHighlights?: boolean | undefined
hasHighlights?: boolean;
tabId?: number | undefined
tabId?: number;
};
if (const typedRequest: {
action: string;
isActive?: boolean | undefined;
hasHighlights?: boolean | undefined;
tabId?: number | undefined;
}
typedRequest.action: string
action === "triggerQuickClip") {
browser.tabs
.query({ active: boolean
active: true, currentWindow: boolean
currentWindow: true })
.then((tabs: any
tabs) => {
if (tabs: any
tabs[0]?.id) {
browser.action.openPopup();
function setTimeout<[]>(callback: () => void, ms?: number | undefined): NodeJS.Timeout (+2 overloads)
Schedules execution of a one-time `callback` after `delay` milliseconds.
The `callback` will likely not be invoked in precisely `delay` milliseconds.
Node.js makes no guarantees about the exact timing of when callbacks will fire,
nor of their ordering. The callback will be called as close as possible to the
time specified.
When `delay` is larger than `2147483647` or less than `1`, the `delay`will be set to `1`. Non-integer delays are truncated to an integer.
If `callback` is not a function, a `TypeError` will be thrown.
This method has a custom variant for promises that is available using `timersPromises.setTimeout()`.setTimeout(() => {
browser.runtime
.sendMessage({ action: string
action: "triggerQuickClip" })
.catch((error: any
error) =>
sendResponse: (response?: any) => void
sendResponse({ success: any
success: error: any
error })
);
}, 500);
}
});
return true;
}
// For other actions that use sendResponse
if (
const typedRequest: {
action: string;
isActive?: boolean | undefined;
hasHighlights?: boolean | undefined;
tabId?: number | undefined;
}
typedRequest.action: string
action === "extractContent" ||
const typedRequest: {
action: string;
isActive?: boolean | undefined;
hasHighlights?: boolean | undefined;
tabId?: number | undefined;
}
typedRequest.action: string
action === "ensureContentScriptLoaded" ||
const typedRequest: {
action: string;
isActive?: boolean | undefined;
hasHighlights?: boolean | undefined;
tabId?: number | undefined;
}
typedRequest.action: string
action === "getHighlighterMode" ||
const typedRequest: {
action: string;
isActive?: boolean | undefined;
hasHighlights?: boolean | undefined;
tabId?: number | undefined;
}
typedRequest.action: string
action === "toggleHighlighterMode"
) {
return true;
}
}
return var undefined
undefined;
}
);
Once this has been changed, recompile the obsidian plugin and load the new custom version we just set up into chrome as well. It may be useful to remove/disable the original obsidian web clipper plugin from chrome while you do this.
You will need to load any of your custom templates and settings into the 'custom' web clipper. It is easiest to just export all settings from the 'real' web clipper and import them into the one you made.
After installing the the custom web clipper, be sure to update the id for the other extension we installed so it sends the trigger to the right place.
If you have done everything correctly, you should be able to visit any web page that matches your trigger extension rules and have it automatically downloaded into obsidian!
Final steps
Now that all of that is working, I needed some way to visit each site, pause for a second to allow obsidian to 'scrape' and then move on. I created a Huge array of URL's that I wanted to scrape and then used a short puppeteer script to manage this relatively easily.
const asyncWait = async (time) => {
await new Promise((resolve) => setTimeout(resolve, time))
};
(async () => {
// MAC: /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --no-first-run --no-default-browser-check --user-data-dir=$(mktemp -d -t 'chrome-remote_data_dir')
// PC: start chrome.exe –remote-debugging-port=9222
// Note: this url changes each time the command is run.
const wsChromeEndpointUrl = 'YOUR_URL_HERE'
const browser = await puppeteer.connect({
browserWSEndpoint: wsChromeEndpointUrl,
})
for (let i = 0; i < cleanedLinks.length; i++) {
const page = await browser.newPage()
console.log(i + cleanedLinks[i])
await page.goto(cleanedLinks[i], {
waitUntil: 'domcontentloaded',
})
await asyncWait(2000)
await asyncWait(Math.random() * 1000)
await page.close()
}
return
})();
You have to connect puppeteer to an existing instance of chrome. This is what the wsChromeEndpointUrl is.
If you run the command from a terminal, it should open a new chrome instance. The url variable you need will be in the terminal output. (Make sure not to close the chrome instance the terminal opened until you are done, as it will reset the url each time you open a new one)
Install both custom extensions into the newly launched chrome instance. Run the puppeteer script, and watch as your scraper begins loading content into your library automatically!!